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Abstract. In this paper we consider iterative methods for stochastic variational inequalities 
(s.v.i.) with monotone operators. Our basic assumption is that the operator possesses both smooth 
and nonsmooth components. Further, only noisy observations of the problem data are available. We 
develop a novel Stochastic Mirror-Prox (SMP) algorithm for solving s.v.i. and show that with the 
convenient stepsize strategy it attains the optimal rates of convergence with respect to the prob- 
lem parameters. We apply the SMP algorithm to Stochastic composite minimization and describe 
particular applications to Stochastic Semidefinite Feasability problem and Eigenvalue minimization. 
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1. Introduction. Let Z be a convex compact set in Euclidean space £ with 
inner product (-,•}, || • || be a norm on E (not necessarily the one associated with the 
inner product), and F : Z — + £ be a monotone mapping: 

(1.1) V(z, z' e Z) : (F{z) - F(z'), «-«')> 0) 

We are interested to approximate a solution to the variational inequality (v.i.) 

(1.2) find S Z : (F(z),z* - z) sC VzeZ 



associated with Z, F. Note that since F is monotone on Z, the condition in (|1.2| is 
implied by (F(z*),z — 2*) ^ for all z £ Z, which is the standard definition of a 
(strong) solution to the v.i. associated with Z, F. The inverse - a solution to v.i. as 
defined by (|1.2|) (a "weak" solution) is a strong solution as well - also is true, provided, 
e.g., that F is continuous. An advantage of the concept of weak solution is that such 
a solution always exists under our assumptions (F is well defined and monotone on a 
convex compact set Z). 

We quantify the inaccuracy of a candidate solution z € Z by the error 

(1.3) Err y :(z) :— max(i T '(w), z — u); 

u£ Z 

note that this error is always ^ and equals zero iff z is a solution to p. 21) . 

In what follows we impose on F, aside of the monotonicity, the requirement 

(1.4) V(z, z' G Z) : \\F(z) ~ < L\\z - z'\\ + M 
with some known constants L ^ 0, M ^ 0. From now on, 

(1.5) IICII* = max (£,z) 

z:\\z\\<l 
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is the norm conjugate to || ■ ||. 

We are interested in the case where (|1.2[1 is solved by an iterative algorithm based 
on a stochastic oracle representation of the operator F(-). Specifically, when solving 
the problem, the algorithm acquires information on F via subsequent calls to a black 
box ("stochastic oracle", SO). At i-th call, i = 0,1,..., the oracle gets as input a 
search point Zi G Z (this point is generated by the algorithm on the basis of the 
information accumulated so far) and returns the vector S(zi, Q), where {Q £ R^}™; 
is a sequence of i.i.d. (and independent of the queries of the algorithm) random 
variables. We suppose that the Borcl function S(z, £) is such that 

(1.6) VzeZ: E{E(z,&)}=F(z), E {\\E(z,Q ~ F(z)\\l} ^ N 2 . 

We call a monotone v.i. (jl.ip . augmented by a stochastic oracle (SO), a stochastic 
monotone v.i. (s.v.i.). 

To motivate our goal, let us start with known results [5] on the limits of perfor- 
mance of iterative algorithms for solving large-scale stochastic v.i.'s. To "normalize" 
the situation, assume that Z is the unit Euclidean ball in £ = R™ and that n is 
large. In this case, the rate of convergence of a whatever algorithm for solving v.i.'s 



cannot be better than O(l) 



In other words, for a properly chosen pos- 



L , M+N 

.* V* 

itive absolute constant C, for every number of steps t, all large enough values of n 
and any algorithm B for solving s.v.i. 's on the unit ball of R ra , one can point out 
a monotone s.v.i. satisfying (|1.4p . (|1.6p and such that the expected error of the 
approximate solution z t generated by B after t steps , applied to such s.v.i., is at least 

c j + M ^ for some c > 0. To the best of our knowledge, no one of existing algo- 
rithms allows to achieve, uniformly in the dimension, this convergence rate. In fact, 
the "best approximations" available are given by Robust Stochastic Approximation 
(see [3] and references therein) with the guaranteed rate of convergence 0(1) L+ ^~ 
and extra-gradient-typc algorithms for solving deterministic monotone v.i.'s with Lip- 
schitz continuous operators (see [6l El [lOj [IT] ) , which attains the accuracy 0(1)4 in 
the case of M = N = or 0(1) ^ when L = N = 0. 

The goal of this paper is to demonstrate that a specific Mirror-Prox algorithm [3] 
for solving monotone v.i.'s with Lipschitz continuous operators can be extended onto 
monotone s.v.i. 's to yield, uniformly in the dimension, the optimal rate of conver- 
gence O(l) — + ^ ■ We present the corresponding extension and investigate it 
in details: we show how the algorithm can be "tuned" to the geometry of the s.v.i. in 
question, derive bounds for the probability of large deviations of the resulting error, 
etc. We also present a number of applications where the specific structure of the rate 
of convergence indeed "makes a difference" . 

The main body of the paper is organized as follows: in Section [2j we describe 
several special cases of monotone v.i.'s we are especially interested in (convex Nash 
equilibria, convex-concave saddle point problems, convex minimization). We single 
out these special cases since here one can define a useful "functional" counterpart 
ErrN(-) of the just defined error Err y j(-); both Eri'N and Err y j will participate in 
our subsequent efficiency estimates. Our main development - the Stochastic Mirror 
Prox (SMP) algorithm - is presented in Section [3l Some general results obout the 
performance of the SMP are presented in Section 13.21 Then in Section |4] we present 
SMP for Stochastic composite minimization and discuss its applications to Stochastic 
Semidefinite Feasability problem and Eigenvalue minimization. All technical proofs 
are collected in the appendix. 
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Notations. In the sequel, lowercase Latin letters denote vectors (and sometimes 
matrices). Script capital letters, like £, y, denote Euclidean spaces; the inner product 
in such a space, say, £, is denoted by (■, -)g (or merely (•, •), when the corresponding 
space is clear from the context). Linear mappings from one Euclidean space to an- 
other, say, from £ to J 7 , are denoted by boldface capitals like A (there are also some 
reserved boldface capitals, like E for expectation, R fe for the fc-dimensional coordi- 
nate space, and S fc for the space of k x k symmetric matrices). A* stands for the 
conjugate to mapping A: if A : £ — > then A* : T — > £ is given by the identity 
(/, Ae)y = (A*/, e)s for / £ J 7 , e £ £. When both the origin and the destination 
space of a linear map, like A, are the standard coordinate spaces, the map is identified 
with its matrix A, and A* is identified with A T . For a norm || • |j on £, || • H* stands 
for the conjugate norm, see (|1.5|) . 

For Euclidean spaces £\, £ m , £ = £\ x ... x £ m denotes their Euclidean direct 
product, so that a vector from £ is a collection u — \ux\ u m ] ("MATLAB notation") 
of vectors ui £ £g, and (it, v)g = v t)£ e - Sometimes we allow ourselves to write 

(til, ...,U m ) instead of [m; ...;u m ]. 

2. Preliminaries. 

2.1. Nash v.i.'s and functional error. In the sequel, we shall be especially 
interested in a special case of v.i. (|1.2p - in a Nash v.i. coming from a convex 
Nash Equilibrium problem, and in the associated functional error measure. The Nash 
Equilibrium problem can be described as follows: there are m players, i-th of them 
choosing a point Zi from a given set Z^. The loss of i-th player is a given function 
4>i(z) of the collection z — (z l7 ...,z m ) 6 Z = Z\ X ... X Z m of player's choices. With 
slight abuse of notation, we use for (f>i(z) also the notation (j)i(z i7 z l ), where z % is the 
collection of choices of all but the i-th players. Players are interested to minimize their 
losses, and Nash equilibrium z is a point from Z such that for every i the function 
(pi(zi,z t ) attains its minimum in Zi £ Zi at z, = % (so that in the state z no player 
has an incentive to change his choice, provided that the other players stick to their 
choices) . 

We call a Nash equilibrium problem convex, if for every i Zi is a compact convex 
set, 0i(zi,z*) is a Lipschitz continuous function convex in Zi and concave in z l , and 
the function <I>(z) = YllLi 4>i{ z ) is convex. It is well known (see, e.g., [S]) that setting 



where d Zi 4>i(zi, z l ) is the subdifferential of the convex function 4>i(-, z l ) at a point Zj, 
we get a monotone operator such that the solutions to the corresponding v.i. (|1.2|l 
are exactly the Nash equilibria. Note that since <pi are Lipschitz continuous, the 
associated operator F can be chosen to be bounded. For this v.i. one can consider, 
along with the v.i. -accuracy measure Err v j(z), the functional error measure 



This accuracy measure admits a transparent justification: this is the sum, over the 
players, of the incentives for a player to change his choice given that other players 
stick to their choices. 

Special cases: saddle ■points and minimization. An important by its own right 
particular case of Nash Equilibrium problem is an antagonistic 2-person game, where 



F(z) = [F x (z); . . . ; F m (z)] , F\z) £ 8 Z M^, z% % = L m 




■m 
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to = 2 and <&(z) = (i.e., <^ 2 (z) = —<fii(z)). The convex case of this problem 
corresponds to the situation when (j){zi,z 2 ) = ^1(21,22) is a Lipschitz continuous 
function which is convex in z\ G Z\ and concave in z 2 S Z2, the Nash equilibria are 
exactly the saddle points (min in z\, max in Z2) of on Z\ x Z 2 , and the functional 
error becomes 

Errjs[(zi, z 2 ) = max [0(zi, Ui) — 0(ui, 22)] ■ 

(Ml,U2)GZ 

Recall that the convex-concave saddle point problem min 2lS z 1 max Z2e z 2 4>{ z i ? ^2) gives 
rise to the "primal-dual" pair of convex optimization problems 

(P) : min (f>(zi), (D) : max <f>(z2), 

Z1EZ1 Z2G Z2 

where 

4>{zi) = max 0(zi,z 2 ), 0(za) = min 4>(zi,z 2 ). 

Z2&Z2 — %i&Z\ 

The optimal values Opt(P) and Opt(P) in these problems are equal, the set of saddle 
points of <f> (i.e., the set of Nash equilibria of the underlying convex Nash problem) 
is exactly the direct product of the optimal sets of (P) and (D), and Eitn (21,22) is 
nothing but the sum of non-optimalities of z\ , z 2 considered as approximate solutions 
to respective optimization problems: 

Err N (z 1; 2 2 ) = P(zi) - Opt(P)] + [Opt(£>) - £(z 2 )] . 

Finally, the "trivial" case to = 1 of the convex Nash Equilibrium is the problem of 
minimizing a Lipschitz continuous convex function 4>{z) = 4>\{z\) over the convex 
compact set Z = Z\, In this case, the functional error becomes the usual residual in 
terms of the objective: 

ErrN(z) = (j)(z) — min0. 

In the sequel, we refer to the v.i. ()1.2j) coming from a convex Nash Equilibrium 
problem as JVash v.i., and to the two just outlined particular cases of the Nash v.i. as 
the Saddle Point and the Minimization v.i., respectively. It is easy to verify that in 
the Saddle Point/Minimization case the functional error Errivj(z) is Err y j(z); this 
is not necessary so for a general Nash v.i. 

2.2. Prox-mapping. We once for ever fix a norm || • || on £; || ■ ||* stands for the 
conjugate norm, see (|1.5p . A distance-generating function for Z is, by definition, a 
continuous convex function lo(-) : Z — * R such that 

1. if Z° be the set of all points 2 £ Z such that the subdiffercntial duj(z) of uj(-) 
at z is nonempty, then the subdifferential of lo admits a continuous selection 
on Z°: there exists a continuous on Z° vector-valued function u)'(z) such that 
uj'(z) G duj(z) for all z e Z°; 

2. for certain a > 0, lo(-) is strongly convex, modulus a, w.r.t. the norm || • ||: 

(2.1) V(z, z' e Z°) : (cj'(z) - lu'(z'), z-z')> ot\\z - z'\\ 2 . 

In the sequel, we fix a distance-generating function w(-) for Z and assume that uj(-) 
and Z "fit" each other, meaning that one can easily solve problems of the form 



(2.2) 



min [lo(z) + (e, z)] , e G £. 
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The prox-function associated with the distance-generating function u> is defined 

as 

V(z, u) = lj(u) - u{z) - (oj'(z), u - z) : Z° x Z -> R+. 

We set 

(a) O(z) = max ue zV(z,u) [z € Z°]\ (6) z c — argmin z u)(z); 



(2.3) 



(c) e = e(z c ); (d) n = ^26/ 



Q. 



Note that z c is well defined (since Z is a convex compact set and w(-) is continuous 
and strongly convex on Z) and belongs to Z° (since € du(z c )). Note also that due 
to the strong convexity of u> and the origin of z c we have 

(2.4) V(u S Z) : -||tt- z c || 2 6 sC maxcj(z) — oj(z c ); 
in particular we see that 

(2.5) Zc{z: \\z-Zc\\ 

Prox-mapping. Given z G Z°, we associate with this point and w(-) the prox- 
mapping 

P(z,C) = argmin{w(w) + {£ - u>' (z) , u)} = axgmin{V(z, it) + (£,«)} : £ -> 

We illustrate the just-defined notions with three basic examples. 

Example 1: Euclidean setup. Here £ is R w with the standard inner product, 
|| • ||2 is the standard Euclidean norm on R N (so that || • ||* = || • ||) and ui(z) = \z T z 
(i.e., Z° = Z, a = 1). Assuming for the sake of simplicity that € Z, z c = 0, 
£1 = max 2£ 2 H^lb and = -^Q 2 . The prox-function and the prox-mapping are given 
by V(z,u) = l\\z-u\\l, P(z,C) = argmin nez \\{z - £) -u\\ 2 . 

Example 2: Simplex setup. Here £ is R w , N > 1, with the standard inner prod- 
uct, \\z\\ = \\z\\i :— Ylj=i \ z j\ ( so tnat ll£ll* = ma Xj \£,j\), Z is a closed convex subset 
of the standard simplex 

N 

P A r = {zeR iv :z^0,^z j = 1} 

3=1 

containing its barycenter, and ui(z) = X/jLi z i mz i i s the entropy. Then 

Z° = {z E Z : z > 0} and u/(z) = [1 + lnzi; 1 + lnzAr, z E Z°. 

It is easily seen (see, e.g., [5]) that here 

a = l, z c = [l/N;...;l/N], 9 < ln(iV) 

(the latter inequality becomes equality when Z contains a vertex of 2?jv), and thus 
il ^ \/2 IniV. The prox-function is 

JV 

V(«,u) = ^Ujlniuj/zj), 
and the prox-mapping is easy to compute when Z = T>m'- 
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Example 3: Spectahedron setup. This is the "matrix analogy" of the Simplex 
setup. Specifically, now £ is the space oi N x N block-diagonal symmetric matrices, 
-/V > 1, of a given block-diagonal structure equipped with the Frobenius inner product 
(a, b)F = Tr(a6) and the trace norm |et|i = ^ j=1 |A,-(a)|, where Ai(a) ^ ... ^ \n(o) 
are the eigenvalues of a symmetric N x N matrix a; the conjugate norm |a|oo is the 
usual spectral norm (the largest singular value) of a. Z is assumed to be a closed 
convex subset of the spectahedron S = {z G £ : z > 0, Tr(z) = 1} containing the 
matrix N^ 1 !^. The distance-generating function is the matrix entropy 



uj(z) 



N 

E 



Xj{z) In Xj(z), 



so that Z° = {z G Z : z >- 0} and fl'(z) = ln(z). This setup, similarly to the Simplex 
one, results in a = 1, z c = A^/jv, 6 = IniV and Q = \/21n7V [2]. When Z = S, 
it is relatively easy to compute the prox-mapping (see [2j [6]); this task reduces to 
the singular value decomposition of a matrix from £ . It should be added that the 
matrices from S are exactly the matrices of the form 

a = H(b) = (Trtexp-j^}))- 1 exp{b} 

with b G £. Note also that when Z — S, the prox-mapping becomes "linear in matrix 
logarithm" : if z = H(a), then P(z, £) — Ti(a — £). 

3. Stochastic Mirror-Prox algorithm. 

3.1. Mirror-Prox algorithm with erroneous information. We are about to 
present the Mirror-Prox algorithm proposed in [5] . In contrast to the original version 
of the method, below we allow for errors when computing the values of F - we assume 
that given a point z G Z, we can compute an approximation F{z) € £ of F{z). The 
i-step Mirror-Prox algorithm as applied to (|1.2|) is as follows: 

Algorithm 3.1. 

1. Initialization: Choose tq G Z° and stepsizes 7 T > 0, 1 ^ r ^ t. 

2. Step t, t = 1, 2, Given r T _i G Z° ', set 



(3.1) 



w T = P(r T _i,7 T F(r T _i)), 
r T = P(r T _i,7 T i ;l (w T )) 



. When t < t, loop to step t + 1 . 
5. v4i step output 



(3.2) 



r=l 



The preliminary technical result on the outlined algorithm is as follows. 
Theorem 3.2. Consider t-step algorithm \ 3.1\ as applied to a v.i. (|1.2[) with a 
monotone operator F satisfying |j.^[). For t = 1,2, let us set 



A T = F(w T ) - F(wr); 
for z belonging to the trajectory {ro, ui\, r\, Wt, r t } of the algorithm, let 

e, = \\F(z) - F(z)\\„ 
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and let {y T £ Z°Y t= q be the sequence given by the recurrence 
(3.3) y T = P(y T _i,7 T A T ), y = r Q . 

Assume that 



(3.4) 
Then 



7 



r V3L' 



Err vi (2i)< (X>J r(t), 



(3.5) 

where Err v j(z t ) is defined in (|1.3[) . 
(3.6) r(t) = 2e(ro) + ]T|| 

r— 1 

t 



M 2 + ( erT _ 1+eu , T ) 2 + ^ 



and 8(-) is defined by 

Finally, when (II. 2[) is a iVas/i v.i., one can replace Err v j(z t ) in (|3.5[) wii/i Errjj^) 



3.2. Main result. From now on. we focus on the case when Algorithm l3.1l solves 
monotone v.i. (|1.2[) . and the corresponding monotone operator F is represented by 
a stochastic oracle. Specifically, at the i-th call to the SO, the input being z £ Z, 
the oracle returns the vector F = S(z,Q)„ where £ R w }™j is a sequence of 
i.i.d. random variables, and 5(z, : 2 x R w — > £ is a Borel function. We refer 
to this specific implementation of Algorithm 13. II as to Stocastic Mirror Prox (SMP) 
algorithm. 

In the sequel, we impose on the SO in question the following assumption, slightly 
milder than (jl.6|) : 

Assumption I: With some /i £ [0, oo), for all z £ Z we have 

( , 7) («) ||E{s(z,c 4 )-^)}IL<^ 

1 j (6) e{||s(z,c,)-^)II 2 Km 2 - 

In some cases, we augment Assumption I by the following 
Assumption II: For all z £ Z and all i we have 

(3.8) E{exp{||S(z,C 4 )-F(z)|| 2 /Af 2 }} s:exp{l}. 

Note that Assumption II implies (13.71 6). since 

exp{E {||5(z,C<) ~ F(z)|| 2 /M 2 }} < E {exp{E(z,Ci) - F(z)|| 2 /Af 2 }} 

by the Jensen inequality. 

Remark 3.3. Observe that that the accuracy of Algorithm ] 3. 1\ fcf. \3.b}) ) devends 
in the same way on the "size" of perturbation e z = \\F(z)—F(z)\\ m and the bound M of 
jl-4]) on the variation of the non-Lipschitz component of F. This is why, to simplify 
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the presentation, we decided to use the same bound M for the scale of perturbation 
E(z,(i)-F(z) m (Eg). 

Remark 3.4. From now on, we assume that the starting point r$ in Alaorithm \S.l\ 
is the minimizer z c ofcu(-) on Z. Further, to avoid unnecessarily complicated formulas 
(and with no harm to the efficiency estimates) we stick to the constant stepsize policy 
7 T = 7, 1 ^ r ^ t, where t is a fixed in advance number of iterations of the algorithm. 
Our main result is as follows: 

Theorem 3.5. Let v.i. (|1.2[) with monotone operator F satisfying (|1.4[) be solved 
by t-step Algorithm \3.1\ using a SO, and let the stepsizes 7 T = j, 1 ^ r ^ t, satisfy 
< 7 < ^U, see (|L3)) . Then 

(i) Under Assumption I, one has 



(3.9) 



E{Err vi (z t )} ^ K Q (t) 



aSl 2 21ikP 7 
1 - 

try 



2a 



where M is the constant from (|1.4p and SI is given by (|2.3[) 
(ii) Under Assumptions I, II, one has, in addition to 



for any A > 0, 

(3.10) Prob (Err vi (z 4 ) > K (t) + AK^t)} < exp{-A 2 /3} + exp{-At}, 
where 

7M 2 7 2MSI 



Ki(t) = 



2a 



Vt ' 



In the case of a Nash v.i., Err y j(-) in (|3.9p , (|3 . 1 0[) can be replaced with Eitn(-). 
When optimizing the bound (|3.9[) in 7, we get the following 
Corollary 3.6. In the situation of Theorem \3.5l let the stepsizes 7 T = 7 be 

chosen according to 



(3.11) 



7 



a 



aSl 



7 a'L 
4 t 



7 om 



2^si, 



u V3i' M V 21t 

Then under Assumption I one has 

(3.12) E {Err vi (z 4 } sC K*{t) = max 

(see (|2.3|) ). Under Assumptions I, II, one has, in addition to ()3.12|1 . for any A > 0, 

(3.13) Prob{Err vi (z t ) > Kg (t) + KK{ (t) } sC exp{-A 2 /3} + exp{-Ai} 
with 

7 SIM 



K{{t) 



2 y/t 



In the case of a Nash v.i., Err y j(-) in (|3.12|) . (|3.13p can be replaced with ErrN(-)- 

3.3. Comparison with Robust Mirror SA Algorithm. Consider the case of 
a Nash s.v.i. with operator F satisfying (|1.4j) with L = 0, and let the SO be unbiased 
(i.e., fi = 0). In this case, the bound (|3.12|) reads 



(3.14) 
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where 

M 2 



sup \\F(z)-F(z')\\l supE{||H(^Ci)-^)ll*} 

z,z'6Z zez 



The bound (|3.14p looks very much like the efficiency estimate 

MM 



(3.15) E{Err N (5 t )} < 0(1) 



Vt 



(from now on, all 0(l)'s are appropriate absolute positive constants) for the approx- 
imate solution zt of the i-step Robust Mirror SA (RMSA) algorithm [3"ffi. In the 
latter estimate, f2 is exactly the same as in (|3.14p . and M is given by 



M 2 



sup||F(z)||2; su P V{\\Z(z 7 Q)-F(z)\\l} 

z z£Z 



Note that we always have M ^ 2M , and typically M and M are of the same order of 
magnitude; it may happen, however (think of the case when F is "almost constant"), 
that M <C M . Thus, the bound (|3 . 14[) never is worse, and sometimes can be much 
better than the SA bound (|3.15p . It should be added that as far as implementation 
is concerned, the SMP algorithm is not more complicated than the RMSA (cf. the 
description of Algorithm 13.11 with the description 

r t =P(r t _ 1 ,% t _i)), 



of the RMSA). 

The just outlined advantage of SMP as compared to the usual Stochastic Ap- 
proximation is not that important, since "typically" M and M are of the same order. 
We believe that the most interesting feature of the SMP algorithm is its ability to 
take advantage of a specific structure of a stochastic optimization problem, namely, 
insensitivity to the presence in the objective of large, but smooth and well-observable 
components. 

We are about to consider several less straightforward applications of the out- 
lined insensitivity of the SMP algorithm to smooth well-observed components in the 
objective. 

4. Application to Stochastic Approximation: Stochastic composite mi- 
nimization. 

4.1. Problem description. Consider the optimization problem as follows (cf. 

my 

(4.1) min </>(x) := ^(^(x), ...,cp m (x)), 

where 



1 ' In this reference, only the Minimization and the Saddle Point problems are considered. How- 
ever, the results of [3] can be easily extended to s.v.i.'s. 
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1. X C X is a convex compact; the embedding space X is equipped with a 
norm || • H^, and X — with a distance-generating function u> x (x) with certain 
parameters a x ,Q x ^fl x w.r.t. the norm |j ■ || x ; 

2. 4>i(x) : X — > Eg, 1 ^ I ^ m, are Lipschitz continuous mappings taking values 
in Euclidean spaces £t equipped with norms (not necessarily the Euclidean 
ones) || • ||(f) with conjugates || • ||(^.*) and with closed convex cones Ki. We 
suppose that <pe are i^-convex, i.e. for any x, x 1 G X, A e [0, 1], 

+ (1 - A)x') < Ki ty t (x) + (1 - X)Mx'), 

where the notation a b b a means that b — a G K . 

In addition to these structural restrictions, we assume that for all v,v' G 

X, he X, 



(4.2) 



(a) \M(v) ~ &(v')}h\\ {i) < [Lxllu-u'lU + M^llfelU 
(6) ||[#(r>)]ft|| W < [iA + Mj||ft|U 



for certain selections 4>' e {x) G d Kl (f>e(x), x G X^ and certain nonnegative 
constants L x and M x . 
3. Functions 4>i{-) are represented by an unbiased SO. At i-th call to the oracle, 
x G X being the input, the oracle returns vectors fe(x,Q) G Eg and linear 
mappings Gi(x, Q) from X to Eg, 1 ^ I ^ m ({Ci} are i-i-d. random vectors) 
such that for any sc G X and i = 1,2, 



(a) E{//(x,C<)} = Ma). m 
(6) e{ max \\Mx,Ci) - 4>t(x)\\%^ ^ ,f ' J() ' J ' 
(4-3) ( c ) E{G/(a:,Ci)} = #(a:), 



(d) E ^ max ||[G^(ar,C*) - ^)]h\\f e) ^ M*, 1 < £ < m. 
4. $(•) is a convex function on £ = E\ x ... x E m given by the representation 
(4.4) u™) =max V(w £ ,A^ + M^ ? 




for U£ € Eg, 1 ^ £ ^ m. Here 

(a) y C y is a convex compact set containing the origin; the embedding 
Euclidean space y is equipped with a norm || • || y , and Y - with a distance- 
generating function uj y (y) with parameters a y ,Qy,Q y w.r.t. the norm 

II • II,; 

(b) The afhne mappings y i— > Agy + bi : y — ► £^ are such that A^y + be G -?Q 
for all y G F and all ^; here if| is the cone dual to A^; 



2 ' For a ii"-convex function <fr : X —> £ (X C 
x £ X, the K -subdiffcrential d K <j>(x) is comprised 
4>( u ) =^if 0O>O + p (« - x ) for all w S X. When 
all x S X; if tfi is differentiable at a; 6 intX (as it 

2^1 G a*0(x). 



A* is convex, K C £ is a closed convex cone) and 
of all linear mappings h h-+ P/i : if — + £ such that 
is Lipschitz continuous on X, d K <f>(x) ^ for 
is the case almost everywhere on intX), one has 
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(c) 3>* (y) is a given convex function on Y such that 

(4.5) \WM-K{y')\\y,*^Ly\\y-y'\\y 



for certain selection &*(z) £ 9$*(y), y EY. 
Example: Stochastic Matrix Minimax problem (SMMP). For 1 ^ t ^ to, let £ e = S Pl 
be the space of symmetric pi x p^ matrices equipped with the Frobenius inner product 
(A,B)p = Tr(AB) and the spectral norms | • |oo, and let Kg be the cone S+ of 
symmetric positive semidefinite pi x pi matrices. Consider the problem 



mm max A max ^ Pj t 4>t{x)P j( . , 



(P) 



where Pji are given pi x qj matrices, and A max (j4) is the maximal eigenvalue of a 
symmetric matrix A. Observing that for a symmetric p x q matrix A one has 



A„ 



A A) = maxTr(AS') 

ses„ 



where S q = {S G : Tr(5) = 1}. When denoting by Y the set of all symmetric 
positive semidefinite block-diagonal matrices y — Diagjyi, y^} with unit trace and 
diagonal blocks yj of sizes qj x qj, we can represent (P) in the form of (|4.1[) . (|4.4[) 
with 



$(w) := ^ax^ A max ( ^ P^uePj, 



k 

max >^ Tr 

y=Dia.g{y 1 ,...,y k }eY 



j/=Diag{j/i,...,y fc }ey 




max > A u e,My)F 

y=Diag{y 1: ...,y k }£Y *— * 



(we put A^y = X^=i PjiyjPji)- The set Y is the spectahedron in the space S q of 
symmetric block-diagonal matrices with fc diagonal blocks of the sizes qj X qj , 1 ^ j 



k. When equipping Y with the spectahedron setup, we get a y — 1, Q y — ln(J^, = x Qj) 



and fly = y 21n(^ J=1 qj), see Section 

Observe that in the simplest case of k = to, pj = qj, 1 ^ j ^ to and Pj£ equal to 
Ip for j = £ and to otherwise, the SMMP problem becomes 



(4.6) 



mm 



max \ max (<j>i(x)) 

l<£<m 



If, in addition, pj = qj = 1 for all j, we arrive at the usual ("scalar" ) minimax problem 



(4.7) 



mm 



max 6e(x) 
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with convex real- valued functions 

Observe that in the case of (|4.4p . the optimization problem (|4. 1 [) is nothing but 
the primal problem associated with the saddle point problem 



(4.8) 



mm max 



and the cost function in the latter problem is Lipschitz continuous and convex-concave 
due to the iQ-convexity of 4>t{-) and the condition A(y + b? £ K\ whenever y EY . 
The associated Nash v.i. is given by the domain Z and the monotone mapping 



(4.9) F(z) = F(x,y) = 



The advantage of the v.i. reformulation of (|4.1|) is that f 1 is linear in (f>e(-), so that 
the initial unbiased SO for (f>£ induces an unbiased stochastic oracle for F, specifically, 
the oracle 



(4.10) E(x,y,Ci) 



J2 G*i (x, Q)[A e y + b e };-Y, A|/,(x, Q) + (y) 



( = 1 



We are about to use this oracle in order to solve the stochastic composite minimization 
problem (|4.1|) by the SMP algorithm. 



4.2. Setup for the SMP as applied to (|4~9|) . In retrospect, the setup for SMP 
we are about to present is a kind of the best - resulting in the best possible efficiency 
estimate p. 121) - we can build from the entities participating in the description of the 
problem (14. 1|) . Specifically, we equip the space £ = X x y with the norm 



\\[(x )y )\\ = J\\x\\i/ni + \\ y \\i/n% 

the conjugate norm clearly is 



\mv)\u = ^iu\\i.+^ y \\v\\i,.- 



Finally, we equip Z = X x Y with the distance-generating function 



a y ll y 



The SMP-related properties of our setup are summarized in the following 
Lemma 4.1. Let 



(4.11) 



(i) The parameters of the just defined distance- generating function to w.r.t. the 
just defined norm \\ ■ \\ are a = 1, = 1, £1 = 

(ii) One has 



(4.12) 



V(z, z' G Z) : \\F{z) - F(z')ll* ^ L\\z - z'\\ + M, 
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where 



L = 5An x n v p x L x + M x ] + BVl\L x + n 2 y L. 
M = [2An y + ||6||i] tt x M x + n y M y 



Besides this, 




then 



(4.15) 



E {exp{||S(z, CO - F(z)\\l/M 2 }} < exp{l}. 



Combining Lemma l4~Tl with Corollary ()3.6|) we get explicit efficiency estimates for the 
SMP algorithm as applied to the Stochastic composite minimization problem (|4.1| . 

4.3. Application to Stochastic Semidefinite Feasibility problem. Assume 
we are interested to solve a feasible system of matrix inequalities 



where m>l,XcAfisasin the description of the Stochastic composite problem, 
and ipi(-) take values in the spaces Ei = S Pi of symmetric pi x pg matrices. We equip 
Eg with the Frobenius inner product, the semidefinite cone Kg = S+ and the spectral 
norm || • ||m = | • |oo (recall that \A\oo is the maximal singular value of matrix A). We 
assume that tpg are Lipschitz continuous and Kg — S+ -convex functions on X such 
that for all x, x' € X and for all i one has 



for certain selections ip'g( x ) € d Ke ipi(x), x G X, with some known nonnegative con- 
stants Lg, Mg. 

We assume that ipe{') are represented by an SO which at i-th call, the input being 
x G X, returns the matrices fg(x, Q) G S pe and the linear maps Ge(x, £j) from X to 
Eg such that for all x G X it holds 



(4.16) 



i>l{x) ^0,£= l,...,m & xeX, 



max \[ip' t (x) - ip'^x'^hloo < Lg\\x - x'\\ [t) + M e , 



(4.17) 



max \tl>'f(x)h\oo ^ Lf£l x + Me 
hex, ||/i|U<i 



(4.18) 




Given a number t of steps of the SMP algorithm, let us act as follows. 
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A. We compute the m quantities = + Mi, £ = 1, m, and 



set 



(4.19) n = max fi t , fa = — ,&(•) = fa-ipi(-), L x = M x = ju. 

Note that by construction fa ^ 1 and L x /Lg ^ fa, M x /Mi ^ fa for all I, so that the 
functions (fit satisfy ()4.2|) with the just defined L x , M x . Further, the SO for V^(")' s 
can be converted into an SO for <^(-)'s by setting 

h{x, C) = faftix, C), G t {x, = faG e (x, C). 

By prig]) , this oracle satisfies (|4"U1) . 

B. We then build the Stochastic Matrix Minimax problem 



(4.20) 



min max X max ((fii(x)), 



associated with the just defined <fii, <fi m , that is, the Stochastic composite problem 
(14. ip associated with (fix, <p m and the outer function 



$(iti,...,u m ) = max A max («£) = max VVu f ,y f ) F , 



y = {y = Diag{ yi , y m } eJ^-S^x ... x S*» : y h 0, Tr(y) = 1} 

cy = s pi x ... x s Pm . 



Thus in the notation from (|4.4[) we have A^y = , bi — 0, <&* = 0. Hence i x = M x 
0. and Y is a spectahedron. We equip y and Y with the Spectahedron setup, arriving 
at 



= 1, Qy = lll^P£, fly 



■ 



\ 



21n^ W . 



C. We have specified all entities participating in the description of the Stochastic 
composite problem. It is immediately seen that these entities satisfy all conditions 
of Section 14.11 We can now solve the resulting Stochastic composite problem by 
i-step SMP algorithm with the setup presented in Section 14.21 The corresponding 
convex-concave saddle point problem is 

m 

min max } fa (ifie (x) , yt) F ; 
with the monotone operator and SO, respectively, 
F(z) = F(x,y) 



s(C&,y),0 = 



^2fa[ip't(x)]*yf,- Diag {a^x (x) , a m ip m (x)} 

rn 

^2 PiG* e (x, C)y e ; - Diag [a^x, (), ...,a m f m (x, C))| 



Combining Lemma 14. 11 Corollary [376] and taking into account the origin of the quan- 
tities L x , M x , and that A = 1, B = oEf, we arrive at the following result: 



3 ' See ||4,11J | and note that we are in the case when be = and || ■ \\(e,*) i s the trace norm; thus, 

Efci \\ A ey\\(i,*) = E£i \ye\i = \y\i = \\y\\ y - 
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Proposition 4.2. With the outlined construction, the resulting s.v.i. reads 
(4.21) find z* € Z = X X Y : (F(z), z-z*)>0 Vz e Z, 

for the monotone operator F which satisfies (|1 .4[) with 



L = 10 



^=1 



o^Vt + i), m = 4 



6=1 



Beside this, the resulting SO for F satisfies (|4. 13|) im'i/i i/ie just defined value of M. 
Let now 



7= 10 



31nE 



£=1 



fl x fi(y/i+l) , 1 <r < t. 



When applying to (|4.21[) the t-step SMP algorithm with the constant stepsizes 7 T = 7 
C c/. p. lip cm<i note that we are in the situation a = & = 1), we get an approximate 
solution z t — (xt,yt) such that 



(4.22) 



E <^ max A?Amax(?Mzt)) f < 80 



(" c/. ()3.12|) cm<i £a/ce into account that we are in the case of Q = while the optimal 
value in (|4.20|) is nonpositive, since (|4~16ll is feasible). 

Furthermore, if assumptions <\A.lH\ b.c) are strengthened to 



E ] max exp{\Mx, &) ~ M^lL/i^M'} } < exp{l}, 



E <^ exp{ max |[G^(a;, Ci) - ^(^l^lSoMI }" < exp{l}, U^ra, 
then, in addition to (|4.22p . we have for any A > 0: 



Prob< max /JMmaxW'AZt)) > 80 j= h A — 

I yjt Vt 

s$ exp{-A 2 /3} + exp{-Ai}. 



Discussion. Imagine that instead of solving the system of matrix inequalities 
(|4.16p . we were interested to solve just a single matrix inequality ipi( x ) d 0, x G X. 
When solving this inequality by the SMP algorithm as explained above, the efficiency 
estimate would be 



E{V*(2j)} < 0(l)[ln( Pe + l)} 1/2 n 



fl x Li 



OW^ + l)] 1 / 2 ^ 1 ^, 



Vt 



o(i)[in( w + i)] 1/2 
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(recall that the matrix inequality in question is feasible), where x\ is the resulting 
approximate solution. Looking at (|4.22[) . we see that the expected accuracy of the 
SMP as applied, in the aforementioned manner, to (|4.16p is only by a logarithmic in 
J2i Pi factor worse: 



(4.23) E{^(x t )KO(l) 



1/2 



-1 

~7i 



0(1) 



1/2 



Thus, as far as the quality of the SPM-generated solution is concerned, passing from 
solving a single matrix inequality to solving a system of m inequalities is "nearly 
costless" . As an illustration, consider the case where some of tpt are "easy" - smooth 
and easy-to-observe {Mi = 0), while the remaining ipe are "difficult", i.e., might be 
non-smooth and/or difficult-to-observe (Lg = 0). In this case, (|4.23j) reads 



E{^(s*)} s; 



o(i)[inE"li; 
0(i)pnEJli; 



,1/2 nlL e 



ii/2 n x M e 
1 Vi ■ 



tpt is easy, 
ipe is difficult. 



In other words, the violations of the easy and the difficult constraints in (|4.16p converge 
to as t — ► oo with the rates 0(l/t) and 0(l/vi), respectively. It should be added 
that when X is the unit Euclidean ball in X = R n and X, X are equipped with the 
Euclidean setup, the rates of convergence 0(l/t) and 0(l/yt) are the best rates one 
can achieve without imposing bounds on n and/or imposing additional restrictions 
on tpi's. 

4.4. Eigenvalue optimization via SMP. The problem we are interested in 
now is 



(4.24) 



Opt 
X 



= romf(x) 
= {x e R™ 



:= A max (-4o + x\A\ + 



iA n ), 



where Aq, A\, An, n > 1, are given symmetric matrices with common block- 
diagonal structure (pi, ...,p m ). I.e., all Aj are block-diagonal with diagonal blocks 
Aj of sizes pe x p e , 1 ^ i ^ m. We denote 



m 

p(") = Y,Ph « = 1, 2, 3; p max = max W . 

Setting 



n 

<j)e: Ih^ = S p ", &(x) = A e + ^ a;^., 1 «S £ < m, 

3=1 



we represent (|4.24[) as a particular case of the Matrix Minimax problem (14. 6|) . with 
all functions ^(x) being affine and X being the standard simplex in X = R™. 

Now, since Aj are known in advance, there is nothing stochastic in our problem, 
and it can be solved either by interior point methods, or by "computationally cheap" 
gradient-type methods; these latter methods are preferable when the problem is large- 
scale and medium accuracy solutions are sought. For instance, one can apply the t-step 
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(deterministic) Mirror Prox algorithm from [5] to the saddle point reformulation 
of our specific Matrix Minimax problem, i.e., to the saddle point problem 

mm max (y , A + £"=i x 3 Aj) F , 
(4 25) xeX v^ Y 

Y = {y = Diag{ m , y m } : Vi e , 1 < £ < m, Tr(F) = 1} . 

The accuracy of the approximate solution it of the (deterministic) Mirror Prox algo- 
rithm is [!>1 Example 2] 



/(^)-o P t<o(i)^W^ 



This efficiency estimate is the best known so far among those attainable with "com- 
putationally cheap" deterministic methods. On the other hand, the complexity of 
one step of the algorithm is dominated, up to an absolute constant factor, by the 
necessity, given x € X and y £Y, 

1. to compute the matrix Ao+Xw=i x j^j and the vector [Tr(V^4i); Tr(FA n )]; 

2. to compute the eigenvalue decomposition of y. 

When using the standard Linear Algebra, the computational effort per step is 

C det = 0(l)[n P W +p (3>] 

arithmetic operations. 

We are about to demonstrate that one can equip the deterministic problem in 
question by an "artificial" SO in such a way that the associated SMP algorithm, under 
certain circumstances, exhibits better performance than deterministic algorithms. Let 
us consider the following construction of the SO for F (different from the SO Q4.10[) 0. 
Observe that the monotone operator associated with the saddle point problem (14. 25[) 



(4.26) F(x,y) = 



En 

■ ■ > , 1=1 

FX(X ' V) Fv{x,y) 



Given x e X, y = Diag{yi, y m } G Y, we build a random estimate 5 = [E x ; S y ] of 

F(x,y) = [F x (x, y); F y (x, y)] as follows: 

1. we generate a realization j of a random variable taking values 1, ...,n with 
probabilities x\,...,x n (recall that x S X, the standard simplex, so that x 
indeed can be seen as a probability distribution), and set 



(4.27) E y = A + A 



2. we compute the quantities vg = Tr(y^), 1 ^ £ ^ to. Since y £ Y, we have 
vi ^ and YleLi v l — We further generate a realization i of random 
variable taking values 1, to with probabilities v\, v m , and set 

(4.28) Z* = [Tr(A\yA; Tr^y,)], y t = (Tr^))" 1 ^. 

The just defined random estimate S of y) can be expressed as a deterministic 
function S(£c,y, 77) of (cc,y) and random variable r\ uniformly distributed on [0,1]. 
Given x, y and 77, the value of this function can be computed with the arithmetic cost 
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0(l)(n(p max ) 2 +p( 2 >) (indeed, 0(l)(n+p (1) ) operations are needed to convert r\ into 
i and j, 0(l)p( 2 ) operations are used to write down the y-component — j4 — A } of 
S, and 0(l)n(p max ) 2 operations are needed to compute S x ). Now consider the SO's 
2^ (fc is a positive integer) obtained by averaging the outputs of k calls to our basic 
oracle S. Specifically, at the i-t call to the oracle z = (x, y) G Z = X x Y being 
the input, the oracle returns the vector 

1 k 

s=l 

where Q — [rjn; rjik] and {ffoji^i, i^ s ^fc are independent random variables uni- 
formly distributed on [0, 1]. Note that the arithmetic cost of a single call to 2^ is 

C k = 0(l)k(n(p max ) 2 +p^). 

The Nash v.i. associated with the saddle point problem (14.251) with the stochastic 
oracle (k being the first parameter of our construction) specify a Nash s.v.i. on 
the domain Z — X x Y. Let us equip the standard simplex X and its embedding 
space X — R™ with the Simplex setup, and the spectahedron Y and its embedding 
space y = S Pl x ... x S Pm with the Spectahedron setup (see Section [272]) . Let us next 
combine the x- and the y-setups, exactly as explained in the beginning of Section [4.21 
into an SMP setup for the domain Z = X x Y - a distance-generating function uj(-) 
and a norm || • || on the embedding space R" x (S Pl x ... x S w ) of Z. The SMP-related 
properties of the resulting setup are summarized in the following statement. 
Lemma 4.3. Let n ^ 3, ^ 3. Then 

(i) The parameters of the just defined distance- generating function to w.r.t. the 
just defined norm \\ ■ \\ are a = 1, = 1, fi = y/2. 

(ii) For any z,z' S Z one has 

(4.29) \\F(z) - F(z')\\* < L\\z- z'\\, L = 21n(n) + 41n(p (1) ). 
Besides this, for any (z € Z, i = 1, 2, 

(a) E{H fc (z, &)} =-F(*); 

(4.30) (6) ElexplllS^CO-F^I^/A/ 2 }} exp{l}, 

M = 27[ln(n) + In^ 1 ))]^/^. 
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5. Appendix. 

5.1. Proof of Theorem 13.21 We start with the following simple observation: 
if r e is a solution to (|2.2|) . then dzLo(r e ) contains — e and thus is nonempty, so that 
r e G Z°. Moreover, one has 

(5.1) (w'(r e )-e,ti-r e )^0Vi(eZ. 

Indeed, by continuity argument, it suffices to verify the inequality in the case when 
u G rint(Z) C Z°. For such an u, the convex function 

f(t)=uj(r e + t{u-r e )) + {r e + t(u-r e ),e), te [0,1] 

is continuous on [0, 1] and has a continuous on [0, 1] field of subgradicnts 

g{t) = (uj'(r e + t(u - r e )) + e,u- r e ). 

It follows that the function is continuously differentiable on [0, 1] with the derivative 

g(t). Since the function attains its minimum on [0, 1] at t = 0, we have g(0) ^ 0, 

which is exactly (|5.ip . 

At least the first statement of the following Lemma is well-known: 

Lemma 5.1. For every z G Z° ', the mapping £ i— > P(z,£) is a single-valued 

mapping of £ onto Z°, and this mapping is Lipschitz continuous, specifically, 

(5.2) llP^O-PMKa-lC-f/ll* VO/e£. 

Besides this, 

(a) V(ueZ):V(P(z,(),u) < V(z, u) + <£ u - P(z, Q) ~ V{z, P(z, 0) 

( - j (b) ^ v ^ u) + {CjU ^ z) + m_ 

Proof. 

Let v € P(zX), w 6 P{z,if). As V^(z,u) = uj'(u) — ui'(z), invoking 15. 11 we have 
v,w <E Z° and 

(5.4) (u>'(v)-(j'(z) + (,v-u) ^0 \/ueZ. 

(5.5) ( LU '( W ) ~ w i z ) + i],w ~ u) ^0 Yu G Z. 
Setting u = w in (|5.4| and u = v in (|5.5j) . we get 

(uj'(v) - J(z) +(,v -w) ^0, (w'(w) - uj'(z) +r],v-w) ^0, 
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whence (lo'(w) — lo'(v) + [rj — £]. v — w) ^ 0, or 

\\r) — C||*|| u — w\\ ^ (rj ~ <^,v ~ w) ^ — ui' '(w), v — w) ^ a\\v — w|| 2 , 

and (|5.2p follows. This relation, as a byproduct, implies that P(z, •) is single-valued. 
To prove let v = P(z, Q. We have 

V{v, u) — V(z, u) = [o->(u) — (ui (v ), u — v) — uj(v)] — [uj(u) — ((jj'(z), u — z) — u){z)] 

= (u'(v) - uj'(z) + (, v — u) + (C, u — v) — [oj(v) — (uj'(z),v — z) — u>(z)] 
(due to ((53)) ^ (C,u-v) -V(z,v), 

as required in (a) of (|5.3|) . The bound (b) of (|5 ,3|) is obtained from (|5.3[) using the 
Young inequality: 

/* x a II n9 

^ ^+ 2 I|Z_W|1 • 

Indeed, observe that by definition, V(z, •) is strongly convex with parameter a, and 
V(-?,v) ^ f ||z- u|| 2 , so that 

IICII 2 

(C, u - v - 7 Z , t> = C, « - « + C, z-v)- V(z, v)^(C,u-z) + ^. 

2a 

□ 



We have the following simple corollary of Lemma 15.11 

Corollary 5.2. Let £1,^2, ■■■ be a sequence of elements of £ . Define the sequence 
{^rl^Lo * n Z° as follows: 

Vt = P(Vt-i,€t), Vo^Z°. 
Then y T is a measurable function of yo and £i, ...,£ T such that 

t t 
(5.6) (VueZ): (-J2^,u)^V(y ,u)+J2<:r, 

T=l T = l 

with 

(5-7) ICrKr-He.il* (here r = max ||u||); ( T < -<&., y^) + 

«ez 2a 

Proof. Using the bound (6) of (|5.3|) with C = £t an d z = 2/t-i ( so that yt = 
P(yt-i, t;t) we obtain for any u G Z: 

V(y t) u) ~ V(y t -i,u) - (£ t ,u) ^ -(£t,Vt) - V(y t -i,yt) = Ct- 

Note that 

Ct = max[-(&,u) - V(y t -i,«)], 

so that 

-rll&ll* ^ -<&,«*-!> <Ct<r||&||.. 
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Further, due to the strong convexity of V, 

11611 s 

Ct = -<6.yt-i) + [-{Zuvt - yt-i) - V(y t -uvt)] «S -(&,j/t-i) + 

When summing up from r = 1 to r = t we arrive at the corollary. □ 
We also need the following result. 

Lemma 5.3. Let z G Z° , let rj be two points from £ , and let 
w = P(z,Q, r+=P(z,rj) 

Then for all u € Z one has 



(5.8) 



(a) ||w-r + || ^ a IC-^II* 

(b) V(r+,u)-V(z,u) ^ (r),u-w) + 



lKz^k_ f | !u; _ z |j 2 



Proof, (a): this is nothing but (|5.2[) . 

(b): Using (a) of (|5.3[) in Lemma l5.ll we can write for u = r + : 
V(w, r+) < V(z, r+) + (C, r+ - w) - V(z, w). 

This results in 

(5.9) V{z, r + ) > V(w, r+) + V{z, w) + ((, w - r+). 
Using (|5.3[) with n substituted for £ we get 

V(r + ,u) < U(z, u) + (rj, u - r+) - V(z, r+) 

= V(z, u) + (r),u — w) + (r], w — r + ) - V(z, r + ) 
[by ([ST9"]) ] < V(z, u) + {<n, u - w) + {r] - C, w - r+) - V(z, w) - V(w, r+) 

< V(z,u) + {r),u-w) +{t}- (,w - r + ) - —[\\w - z\\ 2 + \\w - r + \\ 2 ], 

due to the strong convexity of V. To conclude the bound (b) of (|5.8[) it suffices to 
note that by the Young inequality, 

(rj-C, w -r + )^^^ + ^\\w-r + \\ 2 . 

□ 

We are able now to prove Theorem 13.21 By (|1.4|) we have that 

||F(u; T ) - F(r T _i)||2 < {L\\r r -i -w T \\ + M + e r _ 1 + e Wr f 

(5.10) < 'AL 2 \\w T - r r _i|| 2 + 3M 2 + 3(e rT _, + 6 Wt ) 2 . 

Let us now apply Lemma 15.31 with z = r T _i, ( = 7 r _F(r T _i), ry = j T F(w T ) (so that 
w = w T and r+ = r>). We have for any u *E Z 

(~f T F(w T ), w T -u) + V(r r ,u) - V(r r -i,u) 
<^|||FK)-^-i)|| 2 -flK-r T -i|| 2 

[by §M] ^^^[\\wr-r T ^\\ 2 + M 2 + {e rT _ 1+ e WT ) 2 }-^\\w T -r T ^\\ 2 
la Z 

[by (EH)] ^M[M 2 + ( erT _ 1+eu) J 2 ] 
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When summing up from r = 1 to r = t we obtain 



J2(jrF(w T ),w T -u}^ V(r ,u) - V(r t ,u) + t Af2 + ^r-i + ^f] 



2a 



T = l 



Hence, for all u G Z, 



^2(^ T F(w T ),w T 



< e(r ) + ^ M [M 2 + {e rr _ t + e WT ) 2 } + ^( 7t A„% 



2a 



(5.11) = 0(r O ) + E t M " + ( e »v-i + £ ^) 2 ] + ^{TrAr.tWr - V r-l) 



T=l 
t 



2a 



+ ^2{1t^t,Vt-1 ~ U), 



where y T are given by (|3.3[) . Since the sequences {y T }, {£r = 7rA T } satisfy the 
premise of Corollarv l5.2[ we have 

(Vu e Z) : EU 1 (7rA T ,y T _ 1 - u) < V(r Q ,u) + £* =1 ^ l|A T || 2 

<e(r )+E*=i^4 T) 

and thus (|5.11[) implies that for any beZ 

t t 
(5.12) ^( 7t F(w t ),uv-m) s;2e(r )+^( 7r A r ,w r -t/ T _i) 



^ 2a 



T = l 



To complete the proof of (|3.5p in the general case, note that since F is monotone, 
(|5.12[) implies that for all u G Z, 

t 

J2jt(f(u), Wt -u) <r(t), 



where 



r (*) = 2e(r ) + £^ 



(cf. (|3.6p ). whence 



r=l 



V(ueZ) : (F(u),z t -u) < 



5> 



r(t). 
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When taking the supremum over u €E Z, we arrive at 

In the case of a Nash v.L, setting w T = (w T> \, w T>m ) and u — (tti, u m ) and 
recalling the origin of F, due to the convexity of <pi(zi, z 1 ) in z%, for all u € Z we get 
from (pnH) : 



r=l i=l 

Setting <j>(z) = YT=i M z )> we S et 



T=l i=l 



E> 



U r ) - E <A»(W»5 ( W r) 



< r(t). 



Recalling that ^(-) is convex and 4n{ui, •) are concave, i = 1, m, the latter inequality 
implies that 



E> 



m 

(2t)-E ( i 6 iK'( 2 t) i ) 

8=1 



< r(t), 



or, which is the same, 



E 



-I -1 



5> 

.T=l 



r(t). 



This relation holds true for all u — (ux, ...,u m ) S Z; taking maximum of both sides 
in u, we get 



Err N (z t ) < 



n -1 



E> 



T(t). 



5.2. Proof of Theorem 13.51 In what follows, we use the notation from Theo- 
rem |3]2j By this theorem, in the case of constant stepsizes j T = 7 we have 

Err vi (z t ) < [t 7 ] _1 r(t), 



(5.13) 

where 



2 * 



(5.14) SC29 



2a 



2a 



^2 1 



M 3 + (e T . T _ 1 +e tt , r ) a + -| 



E[ M2+e ^- 1 +e ^ 



T=l 



For a Nash v.L, Err v j in this relation can be replaced with Err^. 

Note that by description of the algorithm r r _i is a deterministic function of 
^N(r-i) anc j w ^ - 1S a deterministic function of £ M ( T ) for certain increasing sequences 
of integers {M(r)}, {N(t)} such that N(t ~ 1) < M(t) < N(t). Therefore e rr _ 1 is 
a deterministic function of £-W(r— 1)+!^ an d e Wr and A r are deterministic functions of 
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£M(t)+i_ D eno ting by Ej the expectation w.r.t. Q, we conclude that under assumption 
I we have 

(5.15)E JV(r _ 1)+1 { e 2 T i } < Af 2 , E M(r)+1 {e^} ^M 2 , ||E m(t)+1 {A t } ||* < M , 
and under assumption II, in addition, 

( 5 16 ) Ejvfr-ij+i {exp{ e 2 T i Af- 2 }} exp{l}, 

E M(T)+1 {expje^M- 2 }} s: exp{l}. 

Now, let 

7y 



r=l 



We conclude by (|5.15j) that 

(5.17) E{r (t)K r 2a ■ 

Further, y T _i clearly is a deterministic function of £ M ( T - 1 )+ 1 j whence w T — y T -i is a 
deterministic function of ( M ( T \ Therefore 

Em(t)+1 {<A r , W t - 2/r-l)} = (E M(r) + 1 {A T },w T - Vt-x) 

(5.18) < n\\w T - Vr-xW < 2^0, 

where the concluding inequality follows from the fact that Z is contained in the || • ||- 
ball of radius ft = ^/20/a centered at z c , see (|2.5[) . From (|5.18|) it follows that 

E| 7 ^(A T , WT -y T _ 1 )| <2/i 7 iQ. 

Combining the latter relation, ()5.13j) . (|5 . 14[) and (|5.17|) . we arrive at (|3.9j) . (i) is 
proved. 

To prove (ii), observe, first, that setting 



T = l 



we get 



77 2 M 2 

(5.19) To (t) = ^—[t + J t } 



At the same time, we can write 

21 

y > 



i=i 



where £j ^ is a deterministic function of for certain increasing sequence 

of integers Moreover, when denoting by Ej conditional expectation over 

^(i), ^0')+i.„ 5 being fixed, we have 

Ej- {exp{^}} s; exp{l}, 
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see (|5.16p . It follows that 

E jexp{^£,} j = E Ie^ jexp{^&} exp{& +1 } 

(5.20) = E |exp{^e,}E fc+ i {exp{& +1 }} j < cxp{l}E jexp^fc} 

Whence E[exp{ J}] ^ exp{2i}, and applying the Tchebychev inequality, we get 

VA > : Prob {J > 2t + At} sC cxp{-Ai}. 
Along with (|5.19[) it implies that 

(5.21) VA ^ : Prob \ T Q (t) > — I + A— \ exp{-Ai}. 

l_ 2a 2a J 

Let now £ T = (A T , w T — y r - 1)- Recall that w T — y T +\ is a deterministic function of 
(-M{t) _ B es j(j es this, we have seen that \\w T — y T -i\\ ^ D = 2Q. Taking into account 
(EHU), (J5HBJI, we get 

/r 99 \ (°) E A/(r) + l Ur} < P = pD, 

[0 > (b) E m(t)+1 {exp{^i?- 2 }} < exp{l}, with R = MD. 

Observe that exp{a;} ^ x+exp{9x 2 /16} for all x. Thus \b.22\ b) implies for ^ s ^ ^ 
(5.23) E m(t)+1 {exp{s£ r }} < sp + exp{9s 2 i? 2 /16} < exp{sp + 9s 2 i? 2 /16}. 
Further, we have s£ T < |s 2 i? 2 + §£ 2 i? -2 , hence for all s ^ 0, 



E A /(r)+i {exp{s£ T }} < exp{3s 2 i? 2 /8}E M(r)+1 ^exp(|^| [> < exp 



2 ^ ' 3s 2 i? 2 2 



\ 3i? 2 J J K \ 8 3 

When s ^ the latter quantity is ^ 3s 2 i? 2 /4, which combines with (|5.23p to imply 
that for s 0, 

(5.24) E M(r)+1 {exp{s£ T }} < exp{ S p + 3s 2 E 2 /4}. 
Acting as in (|5.20p . we derive from (|5.24p that 

s>0=> E jcxp{s^£ T }j < exp{stp + 3s 2 tR 2 /4}, 

and by the Tchebychev inequality, for all A > 0, 

Prob |Xj£ T >t P + Ai ?Vt| inf cxp{3s 2 ii? 2 /4 - sARVt} = cxp{-A 2 /3}. 

Finally, we arrive at 

(5.25) Prob j 7 ^(A r ,uv - y T -i) > 2-/ fit + AMVi fij < cxp{-A 2 /3}. 

for all A > 0. Combining ([533]) . (l5~14|) . (1531) and flOS]) . we get piDjl . | 
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5.3. Proof of Lemma 14.11 

Proof of (i). We clearly have Z° — X° x Y°, and ui(-) is indeed continuously 
differentiable on this set. Let z = (x,y) and z' = (a/, y'), z,z' £ Z . Then 



(u)'(z) — u> (z ), z — z ) 



T^2"( w x(^) ~ ^(aO^ - a;') + — ^{^'y{v),V - v') 



1 " JU2 1 



> w\\x-x'\\ x +w\\y-y'\\ y >\\W-x;y'-y]\\ 



Thus, lu(-) is strongly convex on Z, modulus a — 1, w.r.t. the norm || • ||. Further, 
the minimizer of w(-) on Z clearly is z c = (x c , y c ), and 



e = 



1 



r6. 



1 



e„ = i, 



so that = 1, whence £1 = \j2Qjct = v2- 

Proof of (ii). 1°. Let z = and z' = (x',y') with z, z' e Z. Observe that 

\\y — y'\\y ^ 20 y and thus 



(5.26) 



\y% < 20, 



due to e Y. 

On the other hand, we have from f|4. 9|) F(z') — F(z) = [A^; A,], where 



A, = YtfiV) - ti(x)riAjy> + b e ] + J2lti(x)TMy' - y], 

t=\ £=1 
m 

A y = -J2 - M*')} + K(y r ) - K(y)- 



We have 
IIAJU,* = 



max 
hex ||A|U<i 



m 

(h, ]T [[^OO - 4i{x))*[Ajy' + h] + [4>' t {x)]*A t [y' - y]] ); 



t=\ 

m 

= £ 

m 



max (h, [4>i (x')-4>t (x)]* [A t y +bt])x + max (h, [<j>t(x)]* At[y' - y])x 

h£X h^X ,\\h\\ x ^.l 



max {[<t>'i{x) - <j>'e(x)]h, Ajy' + h) x + max {[<f>' e (x)]h, A e [y' — y]) x 

h€X heX,\\h\\ x <l 
llhllx^l " 



ma * |||$ 00 - (t>'t(x)]h\\ (l) \\Aiy' + bi\\ {tt ,) 

h£X \\h\\ x ^l 



+ , v m ?f , Wi{x)h\\{i)\\Ai[y' - y\\\ { t,*) 
hex \\h\\ x ^l 



Then by (|4T2| . 



[L x \\x - as'Ha + M x \ [|| A £ y' !!(,,») + |MI^.*)] + [AA + MJHA^y - y']\\y,*) 



= - x'\\ x + M x ] Aty' || «,.) + + + M«] £ ||A,[j/ - y'] 

< [I*||as - ar'||» + M*]L4||j/'L + B\ + + - 



(«,*) 
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by definition of A and B. Next, due to (|5.26p we get by definition of || • | 

||A*|k. < [L x \\x~x'\\ x +M x ][2A{l y +B] + [LM X + M x ]A\\y -y'\\ y 

^[L x Q x \\z-z'\\ + M x ][2AQ. y +B} + [L x Vt x + M x ]AQ. y \\z - z'\\, 

what implies 

(a) : \\A X \\ X ,* < [n x [2Afl y + B]L x + 2Aa y {L x n x + M x ]]\\z-z'\\ + [2Afl y + B]M x 
Further, 

m 

\\A y \\ y ,* = max ( v ,-J2 A ^M^~M^)] + ^*(y')-^*(y))y 

m 

< Jlto> A tlM*)-M*%y + \\*M)-*'.(y)\\v,* 

m 

= m * J2i A w,Mx) - M*'))e t + \\<(y) - *'.&/) II... 

m 

ri^y ,\\r}\\y^\ ^ 

rn 

<- max > ||A/7/||(» *\[L x n x + MJIIa; - as'|L[L v ||j/ — y'L + M v ], 
i7ey,||»j|lv<i^ 

by $£Mb) and (|H)j) . Now 

HAj,!^,* ^ A[L X Q X + M x ]\\x - x'\\ x + [L y \\y ~ y'\\ y + My], 
and we come to 

(6) : \\A y \\ y! * ^ [Q X A[L X Q X + M x ] + fl y L y } \\z- z'\\ + M y . 

From (a) and (b) it follows that 

\\F(z) - F(z')\U < n x ||A x || X! , + HJAX,, 

«S [nl[2AQy + B]L X + :U< >,<->.., /.,<-->, + M x ] + L y fl 2 y ] \\z- z'\\ 

+ 0^2.4^, + B]M X + ClyMy. 

We have justified (f^TT^j) 

2°. Let us verify (|4.13p . The first relation in (|4. 13|) is readily given by (|4.31 a.c). 

Let us fix z — [x, y) € Z and i, and let 

A = F(z)-E(z > Q 

m ^ m 

(5.27) = [Y / We^)-G l (xX i )Y[A e y + b e };-Y, A nM^-fi^,Q)}- 



As we have seen, 

m 

(5.28) £lllMI(<,.)<2,ttl I , + B 
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Besides this, for un € Eg we have 



J2 A >e\\y,* = max A *e u e,v)y = max (V u e , A e r]}y 

Y \\ u 4(i)\\ A tn\\(t,* 



$J max 

»76^,||J7||«<1 



(5.29) < max 

v&y, \\v\\y<i 



l<i<m 



max \\ue\\ w 



Y \\ A ev\\(£,*)=Amiai m \\ue\\(j!). 



Hence, setting u? = 4>i(x) — ft{x, Q) we obtain 

(5.30)||A„||„,* - \\YKlMx)-Mx,0]\\y,*^A max \\4> t (x) - f t (x, Q\\ w . 



Further. 



l^a:|U,* = max (h,y^Wi(x) - G e (x,Q)}*^i)x 

m 

heX, \\h\\a,^l 

m 

m 

^ Y u ~ ma ^ ^ n \\Wt( x ) - Ge(x,(i)]h\\(£) \\H\(e,*) 
tt hex, v — / 



Invoking (|5.28[) . we conclude that 

m 

(5.31) HA^Hx.^^pi^ 



where all pi > 0, £^ < 2-Afij, + B and 

6 = &(C») = max ||[$(x) - G*(x,C»)]7i||(.Q 

flG-V, ||/t||ai<l 

Denoting by p (r)) the second moment of a scalar random variable ry, observe that 
p(-) is a norm on the space of square summable random variables representable as 
deterministic functions of and that 

P (0 < n x M x , p(&) < m, 

by |03l6,d). Now by (jOO)) . (|B3T]l . 

[E{||A||2}]^ = [E{^||A,||^+^||A,||^}]^ 
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<p(Oa:||A x || X) * + Qy\\A y \\y^) < p ^O x ^ + ^jz-AM 

^ Q x ^ Pi maxp(^) + QyAp(g) 
i 

and the latter quantity is ^ M, see (|4.12p . We have established the second relation 
in gUD . 

3°. It remains to prove that in the case of (|4.14p . relation (14.151) takes place. To 
this end, one can repeat word by word the reasoning from item 2° with the function 
Pe(v) = m f {t > :~E {exp{?7 2 /£ 2 }} ^ exp{l}} in the role of p{if}- Note that similarly 
to p{-), p e {-) is a norm on the space of random variables rj which are deterministic 
functions of Q and are such that p e {ii) < oo. | 



5.4. Proof of Lemma 14.31 Item (i) can be verified exactly as in the case of 
Lemma |4~T| the facts expressed in (i) depend solely on the construction from Section 
4.21 preceding the latter Lemma, and are independent of what are the setups for X, X 
and Y, y. 

Let us verify item (ii). Note that we are in the situation 



||(a;, y)|| = v /| |x|| 2 /(21n(n)) + |; / | 2 /(41n( P ( i))), 
\m V)\\* = V21n(n)||e||L+41n(p( 1 ))H 2 . 



(5.32) 

For z = (x, y), z' — [x 1 , y') S Z we have 
F{z)-F{z') 



A x = [Tr((y ~ y')A x )- Tr((j, - y')A n )}; A, = - - x\)A, 

i=i 



whence 



A x 1 1 oo ^\y — y'\i m a x \Aj\oo ^ y/2\n.{n)A 00 \\z — z'\\, 



\Ay\oo < Ik - ^ 



'Hoc max IAjIoo < 2 v /ln(pW)A 0O ||2-2'||, 



and 



||(A X , A y )|U < [21n(n)+41n(p< 1 >)]||*-2 / || ) 



as required in (|4.29|) . Further, relation (|4.30l a) is clear from the construction of 
To prove P~301 &). observe that when (x,y) € Z, we have (see (ET2T)) . P~25]l ) 

||S x (x,y, Ty)!!^ < |^| max |A'|oo ^ A^, 
and, since F x {x, y) = E {S^x, y, £}, 

(5.33) ||5 a: ( a; ,y,r7)-F a ( a;! y)|| 00 <2^ 00 . 
Clearly, 



(5.34) 



iS^a^ijJ-i^foiOloo = l^-^x.^loo sc 2A 
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Applying 4, Theorem 2.1(iii), Example 3.2, Lemma 1], we derive from (|5.33p and 
(|5.34p that for every (x, y) € Z and every i — 1, 2, ... it holds 

E {exp{\\m*, 2/. Ci) ~ F*(x, y)\\l/Nl x }} < exp{l}, 
iVfe,* = 24» (2exp{l/2}v/hi(n) + 3) fc" 1 / 2 

and 

E {exp{||S£(x, y, CO " ^(s, v) ll2o/^2, v }} < exp{l}, 
^ = 2^ ^2exp{l/2}^ln(p( 1 ))+3^ fc" 1/2 . 

Combining the latter bounds with (|5.32[) we conclude (14.301 61 . | 



